Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University

Identifieur interne : 000328 ( Main/Exploration ); précédent : 000327; suivant : 000329

The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University

Auteurs : Emmanuel Giguet [France] ; Nadine Lucas [France]

Source :

RBID : Hal:hal-01069909

Abstract

The GREYC Island team participated in the Structure Extraction Competition part of the INEX Book track for the second time, with the Resurgence software. We used a minimal strategy primarily based on top-down document representation with two levels, part and chapter. The main idea is to use a model describing relationships for elements in the document structure. Frontiers between high-level units are detected, parts and then chapters. Page is also used. The periphery center relationship is calculated on the entire document and reflected on each page. The strong points of the approach are that it deals with the entire document; it handles books without ToCs, and titles that are not represented in the ToC (e. g. preface); it is not dependent on lexicon, hence tolerant to OCR errors and language independent; it is simple and fast.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University</title>
<author>
<name sortKey="Giguet, Emmanuel" sort="Giguet, Emmanuel" uniqKey="Giguet E" first="Emmanuel" last="Giguet">Emmanuel Giguet</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
<author>
<name sortKey="Lucas, Nadine" sort="Lucas, Nadine" uniqKey="Lucas N" first="Nadine" last="Lucas">Nadine Lucas</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01069909</idno>
<idno type="halId">hal-01069909</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01069909</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01069909</idno>
<date when="2011-12-12">2011-12-12</date>
<idno type="wicri:Area/Hal/Corpus">000122</idno>
<idno type="wicri:Area/Hal/Curation">000122</idno>
<idno type="wicri:Area/Hal/Checkpoint">000076</idno>
<idno type="wicri:Area/Main/Merge">000332</idno>
<idno type="wicri:Area/Main/Curation">000328</idno>
<idno type="wicri:Area/Main/Exploration">000328</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University</title>
<author>
<name sortKey="Giguet, Emmanuel" sort="Giguet, Emmanuel" uniqKey="Giguet E" first="Emmanuel" last="Giguet">Emmanuel Giguet</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
<author>
<name sortKey="Lucas, Nadine" sort="Lucas, Nadine" uniqKey="Lucas N" first="Nadine" last="Lucas">Nadine Lucas</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-388300" status="VALID">
<orgName>Equipe Hultech - Laboratoire GREYC - UMR6072</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-150" type="direct"></relation>
<relation name="UMR6072" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300358" type="indirect"></relation>
<relation active="#struct-300266" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-150" type="direct">
<org type="laboratory" xml:id="struct-150" status="VALID">
<orgName>Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen</orgName>
<orgName type="acronym">GREYC</orgName>
<desc>
<address>
<addrLine>Boulevard du Maréchal Juin - 14050 CAEN Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.greyc.fr</ref>
</desc>
<listRelation>
<relation name="UMR6072" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300358" type="direct"></relation>
<relation active="#struct-300266" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR6072" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300358" type="indirect">
<org type="institution" xml:id="struct-300358" status="VALID">
<orgName>Ecole Nationale Supérieure d'Ingénieurs de Caen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300266" type="indirect">
<org type="institution" xml:id="struct-300266" status="INCOMING">
<orgName>Université de Caen Basse-Normandie</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Caen</settlement>
<region type="region" nuts="2">Basse-Normandie</region>
</placeName>
<orgName type="university">Université de Caen Basse-Normandie</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The GREYC Island team participated in the Structure Extraction Competition part of the INEX Book track for the second time, with the Resurgence software. We used a minimal strategy primarily based on top-down document representation with two levels, part and chapter. The main idea is to use a model describing relationships for elements in the document structure. Frontiers between high-level units are detected, parts and then chapters. Page is also used. The periphery center relationship is calculated on the entire document and reflected on each page. The strong points of the approach are that it deals with the entire document; it handles books without ToCs, and titles that are not represented in the ToC (e. g. preface); it is not dependent on lexicon, hence tolerant to OCR errors and language independent; it is simple and fast.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Basse-Normandie</li>
</region>
<settlement>
<li>Caen</li>
</settlement>
<orgName>
<li>Université de Caen Basse-Normandie</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Basse-Normandie">
<name sortKey="Giguet, Emmanuel" sort="Giguet, Emmanuel" uniqKey="Giguet E" first="Emmanuel" last="Giguet">Emmanuel Giguet</name>
</region>
<name sortKey="Lucas, Nadine" sort="Lucas, Nadine" uniqKey="Lucas N" first="Nadine" last="Lucas">Nadine Lucas</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000328 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000328 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01069909
   |texte=   The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024